1
Introduction to Deep Reinforcement Learning (DRL)
EvoClass-AI003 Lecture 9
00:00

Introduction to Deep Reinforcement Learning (DRL)

Deep Reinforcement Learning (DRL) merges the high-dimensional representation capabilities of Deep Neural Networks with the optimal control framework of Reinforcement Learning. Unlike supervised or unsupervised learning, DRL agents learn through trial-and-error interaction within a dynamic environment, making sequential decisions without immediate, explicit labels. This integration allows agents to handle complex, raw inputs (like pixel data) directly.

1. The DRL Learning Paradigm

The RL agent operates in a continuous loop: observing the environment State ($S_t$), performing an Action ($A_t$), and receiving a potentially sparse or delayed scalar Reward ($R_{t+1}$). The primary challenge is the credit assignment problem: determining which past actions were responsible for a future reward signal.

2. The Optimization Objective

The ultimate goal is to discover an optimal strategy, or policy ($\pi^*$), which is a mapping from states to actions, that maximizes the Expected Cumulative Discounted Return ($G_t$). The discount factor ($\gamma \in [0, 1]$) is mathematically crucial, defining how much we value immediate rewards versus rewards expected far into the future.

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$
Question 1
How does the DRL agent receive feedback from the environment?
Explicit labels/targets
Backpropagation through time
Scalar reward signal
Labeled demonstration data
Question 2
What does the policy ($\pi$) mathematically represent?
The predicted total reward
A distribution over actions given a state
The probability of transitioning to a new state
The error between predicted and actual returns
Challenge: The Discount Factor
Analyzing the Temporal Horizon.
Consider two scenarios:
1. $\gamma = 0$
2. $\gamma \approx 1$

Describe the agent's behavioral preference in each case regarding the timeline of rewards.
Step 1
How does the choice of $\gamma$ affect the policy's horizon?
Solution:
If $\gamma = 0$, the agent is myopic (shortsighted), focusing only on the immediate reward $R_{t+1}$. If $\gamma \approx 1$, the agent is far-sighted, equally weighting immediate and distant future rewards, leading to planning over a very long horizon.